Legio: fault resiliency for embarrassingly parallel MPI applications

نویسندگان

چکیده

Due to the increasing size of HPC machines, fault presence is becoming an eventuality that applications must face. Natively, MPI provides no support for execution past detection a fault, and this more constraining. With introduction ULFM (User Level Fault Mitigation library), it has been provided with possible way overtake during application at cost code modifications. intrusive in requires also deep understanding its recovery procedures. In paper we propose Legio, framework lowers complexity introducing resiliency embarrassingly parallel application. By hiding behind calls, library capable expose features transparent manner thus removing any integration effort. Upon failed nodes are discarded continues only non-failed ones. A hierarchical implementation solution proposed reduce overhead repair process when scaling towards large number nodes. We evaluated our solutions on Marconi100 cluster CINECA, showing introduced by negligible does not limit scalability properties MPI. Moreover, integrated real-world further prove robustness injecting faults.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Embarrassingly Parallel Search

We propose the Embarrassingly Parallel Search, a simple and efficient method for solving constraint programming problems in parallel. We split the initial problem into a huge number of independent subproblems and solve them with available workers, for instance cores of machines. The decomposition into subproblems is computed by selecting a subset of variables and by enumerating the combinations...

متن کامل

Fault Tolerant File Models for MPI-IO Parallel File Systems

Abstract. Parallelism in file systems is obtained by using several independent server nodes supporting one or more secondary storage devices. This approach increases the performance and scalability of the system, but a fault in one single node can make the whole system fail. In order to avoid this problem, data must be stored using some kind of redundant technique, so that it can be recovered i...

متن کامل

Memory Debugging of MPI-Parallel Applications in Open MPI

c © 2007 by John von Neumann Institute for Computing Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher ment...

متن کامل

Asymptotically Exact, Embarrassingly Parallel MCMC

Communication costs, resulting from synchronization requirements during learning, can greatly slow down many parallel machine learning algorithms. In this paper, we present a parallel Markov chain Monte Carlo (MCMC) algorithm in which subsets of data are processed independently, with very little communication. First, we arbitrarily partition data onto multiple machines. Then, on each machine, a...

متن کامل

Embarrassingly Parallel Variational Inference in Nonconjugate Models

We develop a parallel variational inference (VI) procedure for use in data-distributed settings, where each machine only has access to a subset of data and runs VI independently, without communicating with other machines. This type of “embarrassingly parallel” procedure has recently been developed for MCMC inference algorithms; however, in many cases it is not possible to directly extend this p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: The Journal of Supercomputing

سال: 2021

ISSN: ['0920-8542', '1573-0484']

DOI: https://doi.org/10.1007/s11227-021-03951-w